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Abstract 

This paper deals with parameter estimation in pair hidden Markov models (pair- 
HMMs). We first provide a rigorous formalism for these models and discuss possible 
definitions of likelihoods. The model being biologically motivated, some restrictions 
with respect to the full parameter space naturally occur. Existence of two different 
Information divergence rates is established and divergence property (namely positivity 
at values different from the true one) is shown under additional assumptions. This yields 
consistency for the parameter in parametrization schemes for which the divergence 
property holds. Simulations illustrate different cases which are not covered by our 
results. 

Key words and phrases: Pair-HMM, pair hidden Markov models, sequence alignment, score 
parameters estimation, TKF evolution model. 



1 Introduction 

1.1 Background 

Sequence alignment has become one of the most powerful tools in bioinformatics. Biological 
sequences are aligned for instance (and among many other examples) to infer gene functions, 
to construct or use protein databases or to construct phylogenetic trees. Concerning this 
last topic, current methods first align the sequences and then infer the phylogeny given 
this fixed alignment. This approach contains a major flaw since the two problems are 
largely intertwined. Indeed, the alignment problem consists in retrieving the places, in 
the observed sequences, where substitution/deletion/insertion events have occurred, due 
to the evolution process. In the pair alignment problem, the observations consist in a 
couple of sequences X\ :n = X± . . . X n and Y\- m = Y± . . . Y m with values on a finite state 
alphabet A (A = {A, C, G, T} for DNA sequences). It is assumed that the sequences share 
a common ancestor. According to biological evolution, the sequence of the ancestor evolves 
and letters in each site may change (substitution event), or be deleted (deletion event), or 
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new letters may be inserted in the sequence (insertion event). This process finally leads to 
the two different observed sequences. A most convenient way of displaying alignments is a 
graphical representation as a path through a rectangular grid (see Figure ^) . A diagonal 
move corresponds to a match between the two sequences, whereas horizontal and vertical 
moves correspond to insertion-deletion events. This path consists of steps £t, t = 1, . . . , I, 
where Et represents either a match (et = (1, 1)) or an insertion-deletion event {et = (1,0) 
or (0, 1)). The length of the alignment is I, and satisfies 

n V m < I < n + m. (1) 

Here nVm denotes the maximum value between n and m. The multiple alignment problem 
is the same, except that one has to retrieve the places where substitution/deletion/insertion 
events have occurred on the basis of a set of (more than two) sequences. 
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Figure 1: Graphical representation of an alignment between two sequences X = AATG 

A A TG - 

and Y = CTGG. The displayed alignment is c - tgg . 

Aligning two sequences relies on the choice of a score optimization scheme (for instance, 
the Needleman-Wunsch algorithm (Nee dleman and Wunsch 1970|0 and therefore the ob- 
tained alignments depend on the score parameters. Choosing these score parameters in the 
most objective way appears as a crucial issue. Because evolution is the force that promotes 
divergence between biological sequences, it is desirable to consider biological alignment in 
the context of evolution. Now, given an evolution model, optimal choices of the score pa- 
rameters depend on the underlying unknown mutation rates and thus on the phylogeny to 
be inferred after the alignment. The existence of such a vicious circle explains the emer- 
gence of probabilistic models where optimal alignment and evolution parameters estimation 
are achieved at the same time. 

Relying on a pioneering work by Bishop and Thompson (1986 ), Thorne, Kishino and 
Felsenstein (|1991|) were the first to provide a maximum likelihood approach to the align- 
ment of a pair of DNA sequences based on a rigorous model of sequence evolution (referred 
to as the TKF model). This model has become quite classical nowadays. In this setup, 
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each site is independently hit by a substitution or deleted, and insertions occur between 
two sites or at both ends of the sequence. Each one of those events occurs at a specific 
rate. When a substitution or an insertion occurs, a new nucleotide is drawn randomly 
according to some probability distribution on the state space {A,C,G,T}. One of the 
advantages of the TKF model lies in its exact correspondence with a model containing 
a hidden Markov structure, ensuring the existence of powerful algorithmic tools based 
on dynamic programming methods. More precisely, the TKF evolution model fits into 
the concept of a pair hidden Markov model (pair-HMM), as first formally described in 
(Durbin, Eddy, Krogh, and Mitchison 1998). 



Observations in a pair-HMM are formed by a couple of sequences (the ones to be aligned) 
and the model assumes that the hidden (i.e. non observed) alignment sequence {st}t is 
a Markov chain that determines the probability distribution of the observations. Since 
the seminal paper (Thornc, Kishino, and Felsenstein 1991), an abundant literature aroused 
in which parameter estimation occurs in a pair-HMM. Thorne, Kishino and Felsenstein 
( 1992 ) slightly improved their original model to take into account insertion and deletion 
of entire fragments (and not only single nucleotides). The TKF model approaches have 
been further developed in ( |Hein, Wiuf, Knudsen, Moller, and Wibling 2000} IMetzler 2 003 
Knudsen and Miyamoto 2003| Miklos, Lunter, and Holmes 20 04), for instance. Let us also 



mention that pair-HMMs were recently combined with classical hidden Markov models 

(HMMs) for ah initio prediction of genes ( Meyer and Durbin 2002| Pachter, Alexander sson, and Cawley 200*21 

Hoboltr Tahd Jensen 2005)) . 



The main difference between pair-HMMs and classical HMMs lies in the observation of a 
pair of sequences instead of a single one. From a practical point of view, the two above mod- 
els are not very different and classical algorithms such as forward or Viterbi algorithms are 



still valid and efficient in the pair-HMM context (we refer to ( Durbin, Eddy, Krogh, and Mitchison 1 998) 
for a complete description of those techniques). Forward algorithm allows to compute the 
likelihood of the two observed sequences and thus, by means of a maximisation technique, to 
approximate the maximum likelihood estimator (MLE) of the parameters. Numerical max- 
imisation approaches are commonly used ( dThorne, Kishino, and Felsenstein TMT] )) but 
statistical approaches using the Expectation-Maximisation (EM) algorithm and its variants 
(Stochastic EM, Stochastic Approximation EM) have recently been explored (|Hofmes 2 005 
|Arribas Gil, Metzler, and Plouhinec 20051 ) • Viterbi algorithm is designed to reconstruct the 
most probable hidden path, thus giving the alignment. From a Bayesian point of view, it is 
also interesting to provide a posterior distribution for parameters and alignments. This can 
be done with MCMC procedures needing again the use of forward algorithm ( Metzler 2003 
Arribas Gil, Metzler, and Plouhinec 2005 ) . 



Nonetheless, from a theoretical point of view, pair-HMMs and classical HMMs are com- 
pletely different. In particular, up to our knowledge, there is no theoretical proofs that 
the maximum likelihood procedure nor the Bayesian estimation give consistent estima- 
tors of the pair-HMM parameters (though it is the case for instance for regular HMMs 
with finite state space, see (jBaum and Petrie 1966)) concerning MLE consistency; see also 
(|Oaliehe and Rosier 20021) for the convergence of the maximum a posteriori hidden path). 

This paper is thus concerned with statistical properties of parameter estimation proce- 
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dures in pair-HMMs. 



1.2 Roadmap 

In Section 2, the pair-HMM is described, together with some properties of the distribution 
of observed sequences. Then we state possible likelihood functions, to be compared with 
the criterion that is optimized in pair-HMM algorithms. We then interpret this last one as 
a likelihood function. 

To investigate consistency of estimators obtained by maximization, one has to understand 
the asymptotic behaviour of the criteria. We adopt the Information Theory terminology 
and call Information divergence rates the difference between the limiting values of the log- 
likelihoods at the (unknown) true parameter value and at another parameter value. Indeed, 
the general model described below may be interpreted as a channel transmitting the input 
X\- n with possible errors, insertions or deletions, leading to the output Y\ :m (see for in- 
stance ( Davey and MacKay 200l{ ILevenshtein 2001"]) on the topic of error correcting codes 



and also ()Csiszar and Korner 198TllCover and Thomas 19 91) for a general introduction to 
Information Theory). In this setting, Information divergence rates have a precise meaning 
(in terms of coding or transmission qualities). In a statistical setting such as ours, they are 
interpreted as divergences that should have a unique minimum at the true parameter value 
(divergence property). Section 3 is devoted to the existence and properties of such limit 
functions (see Theorems ^ and GJ) ■ 

Section 4, then, gives the statistical consequences in terms of consistent estimation of the 
parameters obtained via MLE or Bayesian estimation using pair-HMM algorithms (see The- 
orems 01 EJ. According to these results, consistency holds for the parameter in parametriza- 
tion schemes for which the divergence property holds for the associated Information diver- 
gence rate. 

In a last section, we present several simulation results to investigate situations in which the 
divergence property is not established. We illustrate the consistency results in cases where 
Theorem El applies, as may be seen on numerical computations of information divergence 
rates. We also compare the limiting values of different criteria and give some interpreta- 
tions. Unfortunately, despite the positive results that we obtain we are not yet in terms of 
completely validate pair-HMM algorithms in every situation. 



2 The pair hidden Markov model 
2.1 Model description 

We now describe in details the pair-HMM. Consider a stationary ergodic Markov chain 
{£t}t>i on the state space £ = {(1, 0); (0, 1); (1, 1)}, with transition matrix ir and stationary 
distribution [i = (p,q,r). This chain generates a random walk {Zt}t>o with values in the 
two-dimensional integer lattice N x N, by letting Zq = (0,0) and Z t = Yli< s <t £ s- The 
coordinate random variables corresponding to Z% at time t are denoted by (Nt,Mt) (i.e. 
Z t = (N t ,M t )). We shall either use the notation ir(e s ,e s+ i) to denote the transitions 
probabilities of the matrix n, or explicit symbols like tthv indicating a transition from 
state H = (1,0) to state V = (0, 1) (H stands for horizontal move, V for vertical move and 
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D = (1,1) for diagonal move). 

Conditional on the hidden random walk, the observations are drawn according to the 
following scheme. At time t, if e% = (1,0) then a random variable X is drawn (emitted) 
according to some probability distribution / on A, if £t = (0, 1) then a random variable 
Y is drawn (emitted) according to some probability distribution g on A and finally, if 
Ef = (1,1) then a couple of random variables (X, Y) is drawn (emitted) according to some 
probability distribution h on A x A. Conditionally to the hidden Markov chain {et}t>i, 
all emitted random variables are independent. This model is described by the parameter 
9 = (it, f,g,h) 6 S. The conditional distribution of the observations thus writes 

We(.Xl:Nt,Yl:M t \£l:t: { £ s}s>t, {Xi,Yj}i^N ej j^M s ,0<s<t) = Fg(Xi : N t , Yl:M t l e l:i) 

= n/^j^^^^^^j^^^^M^,^) 11 ^^ 1 ' 1 ^, (2) 

s=l 

where H{-} stands for the indicator function. Moreover, the complete distribution P# is 
given by 

t 

W s (£i :t ,X 1:Nt ,Y 1:Mt ) = £i(ei){ JJvr(e s _i,e s )|p (Xi :A r t ,yi : M t |ei : t)- 

s=2 

Here we denote by Fq (and Eg) the induced probability distribution (and corresponding 
expectation) on £ N x A N x A^ and #o the true parameter corresponding to the distribution 
of the observations (we shall abbreviate to Po an d Eo the probability distribution and 
expectation under parameter 8q). Note that a necessary condition for identifiability of the 
parameter 6 is that the occurrence probability of two aligned letters differs from the product 
probabilities of these letters. That is: 

Assumption 1 

3x,y G A, such that h(x,y) ^ f(x)g(y). 
Indeed, if h = fg, then (j2j) gives 

|n/(^)||n^)|=wi 

Thus, in this case, the distribution of the observations is independent from the hidden 
process and the parameter ir cannot be identified. In the following, we shall always work 
under Assumption^ 

2.2 Observations and likelihoods. 

Statisticians define log-likelihoods to be functions of the parameter, that are equal to the 
logarithm of the probability of the observations. Here, to state what log-likelihoods are, 
one has to decide what do the observed sequences (Xi :n ,Yi :m ) represent. Indeed, one may 
interpret it in at least two different ways: 
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(a) It is the observation of emitted sequences until some time t, so that the log-likelihood 
should be logPg(Xx : Ar t , ^i:Af t )- Here, the probability is that of the observed sequences 
and a point of the hidden process Zt = (Nt, Mt); 

(b) Each observed sequence is one of the emitted sequences X\ : N t for some t and Y\-m s 
for some s, knowing nothing on the hidden process (that is whether t = s, or t > s, or 
t < s), so that the log-likelihood should be log¥$(Xi- n ,Yi- m ). Here, the probability 
is the marginal distribution of the sequences. 

It should be now noted that none of those quantities is the one computed by pair-HMM 
algorithms. We will come back to this fact later. Note also that we imposed the true 
underlying alignment to pass through the fixed point (0, 0) (namely, we assumed Zq = (0, 0)) 
which is not the more general setup (and may introduce a bias in practical applications). 
However, we restrict our attention to this particular setup. 

First, we introduce some notations to make the previous quantities more precise. 
Let us consider the set of all the possible trajectories of the hidden path and the set 
£n,m of trajectories passing through the point (re, m): 

{(0, 1); (1, 0); (1, 1)} N = {e = (e 1; e 2 , ...)} = £ N , (3) 

I 

{e G {(0,l);(l,0);(l,l)}';n Vm < I < n + m; ^ a = (n,m)}. (4) 

i=l 

The length of any trajectory e E £ n ,m is denoted by |e|. Then, we have the following 
equations 

F e (X 1:n , Y hm ) = Pflforoo = e 1:oc ,X hn , Y hm ), (5) 

e££oo 

F e (X hNt ,Y hMt ) = P e (X hNt ,Y 1:Mt ,Z t ) = F e(ei-.t = e 1:t , X 1:Nt , Y hMt ). (6) 

e&£ Ntt M t ;\e\=t 

As Equation Q shows, if one uses the marginal distributions as likelihood, it means that 
when observing two sequences X\. n and Y\- m , it is not assumed that the hidden process 
passes through the observed point (n, m). This results in an alignment with not necessarily 
bounded length (see Figure |5J) • 

We shall now detail Equation (J5J) according to possible alignments. Among all the 
trajectories in £00, we shall distinguish the ones in £ n>m and the ones belonging to some set 
£n,p (with p > m) or £ P;m (with p > n). Those last ones need to be constrained in order to 
avoid multiple counting. Let us denote by £^m (resp. £^m) the restriction of the set £ n ^ m 
to trajectories not ending with an horizontal (resp. vertical) part. More precisely, 

e| e |_ s+ i = (1,0) then s = 0}, 
e| e |_ s+ i = (0,1) then s = 0}. 



f 
f 



£n,m = {e = (ei, . . . ,e| e |) G f ni . m ; If for some s, 

e |e| = e \e\-l = ■■■ = 

£n,m = {e = (ei, . . . ,e| e |) G f njm ; If for some s, 

e |e| = e |e|-l = • • • = 
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Figure 2: Graphical representation of an alignment of sequences X\- n and Y\- m not passing 
through the point (n, m). 

These notations allow to express the marginal distribution Pfl(Xi :n , Y\. m ) as a sum over 
three different path types. 

W0(Xl:n,Y 1:m ) = M£l:\e\=e,X 1:n ,Y 1:m ) 

eef„ 

+ ^ ^ Pe( £ l:|e| = e > -Xl:nj -X"n+l:p = Xn+l:p, *l:m) 

+ ^ ^ ^ IP6»( £ l:|e| = e,Xi :n ,Yi :m ,Y m+ i :p = y m+ i: P ). 
p>m ee£ -v y m+1:p 

This form may not be used for the computation of the marginal distribution Fe(Xi :n , Yi :m ). 

We now give some recursion formulas that could lead to practical implementations of this 
last quantity. For any state e G £ , define Pg as the distribution induced by Fq conditional 
on ei = e. Let us also denote by hx (resp. hy) the marginal with respect to the first (resp. 
second) coordinate of the distribution h. 

Lemma 1 For any n > 1, m > 1, 

We(Xi:n,Y 1:m ) =pP^ 0) (x 1:n ,y 1:m ) + g p(°' 1) (x 1:n ,y 1:m ) + rPi 1 ' 1) (x 1:n ,y 1:m ), (7) 



8 



A. Arribas-Gil, E. Gassiat and C. Matias 



with the following recursions 

F$'°\x 1:n ,Y 1:m ) = f{X l ){ir HH pf°\x 2 , n ,Y l .. m )+K HV pf 1) (X 2 , n ,Y l , m ) 

+7rHD^' 1 \x 2 .. n ,Y 1 .. m )} 

P5 0,1) (*l:„,yi: TO ) = 9(Y 1 ){7T VH F^ \X 1 .. n ,Y 2 .. m )+7T VV Pf' 1 \X 1 .. n ,Y 2 .. m ) 

+vr yD p( 1 ' 1) (X 1:n ,y 2:m )} 
F^iX^Y^) = h(X u Y l ){^ DH F { ^\x 2 .. n ,Y 2 .. m ) + n DV Ff l \x 2 , n ,Y 2 , m ) 

+KDDP i e' 1 \x 2:n ,Y 2 .. m )} 

and initializations: 

F^ 0) (X l ) = fiX,), f£' 1 \y 1 ) = 9 (Y 1 ), P^{X U Y,) = h(X u Y{), 
Pf 1} (X 1:n ) = —± {7T VH f(X l )F e lfi) {X 2 .. n ) + ttvd hx(X 1 )Ff' 1 \x 2:n )}, 

1 — Tlyy 

V { 0'°\Y 1:m ) = — — {tt hv g(Y 1 )Ff' 1 \Y 2:m ) + tt hd h Y (Y^F^ (Y 2:m )} . 

Proof of Lemma ^ is trivial and therefore omitted. 



Interpretation (a) leads to define the log-likelihood £t{9) as 

£ t (9) = ]og¥ (X ltNt ,Y ltMt ), t > 1. (8) 

But since the underlying process {Z t }t>o is not observed, the quantity £t{9) is not a mea- 
surable function of the observations. More precisely, the time t at which observation is 
made is not observed itself. Though, if one decides to use interpretation (a), namely that 
(Xi :n , Yi :m ) corresponds to the observation of the emitted sequences at a point of the hidden 
process Z t = (N t , M t ) and some unknown time t, one does not use £t(9) as a log-likelihood, 
but rather 

w t (9) = logQ e (X 1:iVt ,yi:Mj, t>l (9) 
where for any integers n and m 

Qe(X 1:n ,Y 1:m )=F e (3s >1,Z S = (n,m);X 1:n ,y 1:m ). (10) 

In other words, Qq is the probability of the observed sequences under the assumption that 
the underlying process \et\t>\ passes through the point (n,m). But the length of the 
hidden trajectory remains unknown when computing Qq. This gives the formula: 

Qe{X 1:n ,Y 1:m ) = We(si;\ e \=e,X 1:n ,Y 1:m ). (11) 

.77), 

Let us stress that we have 

w t (9) = logP e (3s >1,Z S = (Nt,Mt);X 1:Nt ,Y 1:Mt ), t > 1, 
meaning that the length of the trajectory is not necessarily t, but is in fact unknown. 
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Qg is the quantity that is computed by forward algorithm (see ( Durbin, Eddy, Krogh, and Mitchison 19 98)) 



and which is used as likelihood in biological applications. It is computed via recursive equa- 
tions similar to those of Lemma ^ I n practice, paths with highest scores according to the 
the Needleman-Wunsch scoring scheme exactly correspond to highest probability paths in a 
pair-HMM, with a corresponding choice of the parameters (( |Durbin, Eddy, Krogh, and Mitchison 19 98)), 
Thus, the quantity Qg is used for finding the best alignment between two sequences. More- 
over, as we explained it in the introduction, the idea of maximizing this quantity with re- 
spect to the parameter 6 has now widely spread among practitioners ( dThorne, Kishino, and Felsenstein 1991} 
Thorne, Kishino, and Felsenstein 1992||Hein, Wiuf, Knudsen, Moller, and Wibling 20001 IMetzler 20031 



Knudsen and Miyamoto 2003||Miklos, Lunter, and Holmes 20 04)). The goal is to obtain an 



objective choice of the parameters appearing in the scoring scheme, taking evolution into 
account. Thus, asymptotic properties of criterion Qg and consequences on asymptotic prop- 
erties of the estimator derived from Qg are of primarily interest. 

According to the relation asymptotic results for t — > oo will imply equivalent ones for 
n, m — > oo. In other words, consistency results obtained when t — > oo can be interpreted 
as valid for long enough observed sequences, even if one does not know t. 



2.3 Biologically motivated restrictions 

Evolution models are commonly chosen time reversible, in the limit of infinitely long se- 
quences. The reversibility property implies that the joint probability of sequence X and 
an ancestor sequence U is not influenced by the fact that X is a descendant of sequence 
U : this joint probability would be the same if X were an ancestor of U or if both were 
descendants of a third sequence. Note that this assumption does not apply on the level of 
alignments. Indeed, for single alignments, one may have P#(e = e, X, Y) ^ P#(e = e', Y, X), 
where e and e' are equal on diagonal steps and have switched insertions and deletions 
(namely, corresponding paths are symmetric around the axis x = y). In fact, it is the 
probability of a whole given set of evolution events (namely mutations, insertions or dele- 
tions occurring in the evolution process), which is a sum over different alignments e (all 
representing this same set of evolution events) of probabilities fg(e = e,X,Y), which is 
conserved if we interchange the two observed sequences. More precisely, we always have 
^ee£i = e ->X,Y) = See£ 2 ^M e = e ^,X) where £\ and £2 are alignments subsets 
representing the same set of evolution events. 

Evolution models rely on two separate processes: the insertion-deletion (indel) and the 
substitution process and both are supposed to be time reversible. As a consequence of time 
reversibility of indel process, the stationary probability of appearance of an insertion or of 
a deletion is the same, meaning that p = q. We thus introduce the following assumption 
on the stationary distribution of the hidden Markov chain: 

Assumption 2 p = q. 

Time reversibility assumption on the substitution process implies equality between the 
marginals of h and individual distributions of the letters, namely hx = / and hy = g. We 
thus also introduce the following assumption on the emission distributions: 

Assumption 3 hx = / and hy = g. 

This last assumption has an interesting consequence on the distribution of only one se- 
quence: 
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Lemma 2 Under Assumption 3, for any integers n and m, any x\ :n and any y\- 



in 



Fg (Z t = (n, m),Y 1:m = y 1:m 



Fg (Z t = (n, m) ,X 1:n = x 1 .. 



F e (Z t = (n,m))f® n (x 1:n ), 
Fg (Z t = (n,m))g® m (y 1:m ). 



Here, f m (x 1:n ) ± f( Xl ) . . . f(x n ). 



Proof 

One has 



Fg (Z t = (n, m),X 1:n = x lm ) 

= ^Fg (Z t = (n, m),X 1:n = x 1:n , Y 1:m = y 1:m ) 



^ Fg (ei-t = e, Xi- n = xi :n , Y 1:m = y x , m ) 



Fg (ei : t = e) Fg (X 1 . n = x 1:n , Y x . m = yi :m \si:t = e) 




so that use of equation (j2J and Assumption 3 gives the first assertion of the Lemma. Proof 
of the second assertion is similar. ■ 

3 Information divergence rates 

3.1 Definition of Information divergence rates. 

In this section, we investigate the asymptotic properties of the log-likelihoods £t(9) and wt(0) 
when properly normalized. We first prove that limiting functions exist. We shall need the 
following parameter sets Go and &s, 5 > 0: 

&s = {9e@\7c(i,j)>6, f(x)>6, g(y)>6, h(x,y)>5,Vi,je£,Vx,yeA}, 

= {9 G 9 | tt(», j) > 0, f(x) > 0, g(y) > 0, h(x, y) > 0, Vi, j G S , Vx, y G A} . 
We shall always assume that 9q G ®o- 
Theorem 1 JTie following holds for any 9 G 0o-' 

ij t~ 1 £t(9) converges Fo-almost surely and in Li, as i tends to infinity to 




ii) t l wt(9) converges F^-almost surely and in Li, as i tonds to infinity to 
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We then define Information divergence rates: 

Definition 1 V0 € O , 

D(0\d Q ) = w(6 ) - w(0) and D*(d\0 ) = £{6 ) - £(9). 

Note that D* is what is usually called the Information divergence rate in Information The- 
ory: it is the limit of the normalized Kullback-Leibler divergence between the distributions 
of the observations at the true parameter value and another parameter value. However, we 
also call D an Information divergence rate since Qg may be interpreted as a likelihood. 

Proof of Theorem ^ 

This proof follows the lines of Leroux ((1992), Theorem 2). We shall use the following 
version of the sub-additive ergodic Theorem due to Kingman (1968 ) to prove point i). A 
similar proof may be written for ii) and is left to the reader. 
Let (W Sj t)o< s <t be a sequence of random variables such that 

1. For all s < t, W , t > W , s + W s , t , 

2. For all k > 0, the joint distributions of (W s+ k,t+k)o<s<t are the same as those of 

{W Sj t)o<s<t, 

3. E (W 0A ) > -oo. 

Then limt-voo t~ Wo,t exists almost surely. If moreover the sequences {W s +k,t+k)k>0 are 
ergodic, then the limit is almost surely deterministic and equals sup t t~ 1 Eo(Wo,t)- If more- 
over Eo(Wo,t) ^ At, for some constant A, then the convergence holds in L]_. 
We apply this theorem to the auxiliary process 

W 8 ,t = maxlog¥g(X Ns+ i :Nt ,Y Ms+hMt \e s+ i = e) + log(^), < s < t, 

where 5q = min e e / e £ 7r(e, e') > 0. We are interested in the behaviour of 
U s>t = log¥ e (X N3+1:Ntl Y Ms +i:M t ), < s < t. 

Since we have exp(U S)t ) = J2ee£ F d(e s +i = e)F & (X Na+1 .. Nt ,Y Ma +i:M t \^s+i = e) leading to 
exp(W S; t — log Sq ) min e£ £- Fg(ei = e) < exp(U St t) < exp(W S) t — log So), we can conclude that 
the desired results on limt_ +00 t~~ 1 Uoj and lim^oo t^ 1 Eo(f7o,t) follow from corresponding 
ones on the process W. 
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Note that since Zq = (0,0) is deterministic, we have Wo,t = max ee £ logP^(Xi : jv i , Yi-.M t I 
E\ = e) + log (56i. Super-additivity (namely point 1.) follows since for any < s < t, 

fe(X 1:Nt ,Y 1:Mt \ei = ei) = ^ Pe(£2:t = e 2: t,Xi : Af t , Yi :Mt |ei = ei) 

ki=* 

> ^ ^ Pe(e2:s = e|. s ,£ s+ i : t = e 2 ,X 1:Nt ,Y 1:Mt \ei = e x ) 

e e£)v s ,Af s e e.Sjsr t -Ns,Mt-M s 
|e 1 |=s e 2 |=t— s 

= p e( e s+2:t = e| t _ s ,XAr s +i:Ar t ,iM s +l:M t ks+i = e?) 

e 1 G£]V s ,JV/ s e2 ^N t -N 3 ,M t -M 3 

|e 1 |=s e 2 |=i-s 

x 7r(e s ,e s+ i)P e (e 2 :s = 4:s> ^i:iv s) >i:AfJei = ei) 
= F e(^Af s +i:Af t ,>Af s +i:M t |es+i = e s+ i )7r(e s , e s+ i )P# (e a = e s , Xi :A r s , F 1:Ms |e x = ei) 

e s ,e s+ i££ 

> {maxP e (XAr s+ i : Ar t ,y Ms +i:Mt|e s +i = e')}{min7r(e,e')}P6»(^i:7V s ,^i:M s ki = ei) , 

e'gf e,e 

so that we get Wb,t > Wb, s + W s ,t, for any < s < t. 

To understand the distribution of {W a ,t)o<s<u note that W s ,t only depends on trajectories 
of the random walk going from the point (N s , M s ) to the point (iVj, Mt) with length t — s. 
Since the process (st)teN 1S stationary, one gets that the distribution of (W s> t) is the same 
as that of (W s+ k j t+k) fo r any k, so that point 2. holds. 
Point 3. comes from: 

E o (W o 7 1 )-logJ = Eomax{log/(X 1 );log 5 (y 1 );log/ i (^i,^i)}>-oo, 

Po-almost surely, since 9 G 0o- Let us fix < s < t. The proof that W s,t = (W s+ k,t+k)k>o 
is ergodic is the same as that of Leroux ((1992), Lemma 1). Let T be the shift operator, 
so that if u = (uk)k>o, the sequence Tu is defined by (Tu)k = (u)k+i for any k > 0. Let 
B be an event which is T-invariant. We need to prove that Po(W /s> * G B) equals or 1. 
For any integer n, there exists a cylinder set B n , depending only on the coordinates ut 
with — m n < k < m n for some sub-sequence m n , such that Po(P^ s '* G BAB„ Ln ) < l/2 n . 
Here, A denotes the symmetric difference between sets. Since W s ' 1 is stationary and B is 
T-invariant: 

Po (W s,t G BAB mn ) = Po (r 2nin W s,t G BAB mn ) 

= P {W^ G BAT- 2m "B mn ) . 

Let B = n n >i Uj>„r 2m 'B mj . Borel-Cantelli's Lemma leads to P (W S '* G BAB) = 0, 
so that P (W^'* G B) = V^W 8 '* € B) = P (W s ' t £ B D B). Now, conditional on (e t ) te ®, 
the random variables (W s +k,t+k)k>o are strongly mixing, so that the — 1 law implies 
(see (Suchcston 1963)) that for any fixed sequence e with values in E^, the probability 
W Q {W S ^ G B\(e t ) t = e) equals or 1, so that 



P (W^EB) =P((s t ) t eC) 
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where C is the set of sequences e such that P(W s,t G B\(et)t = e) = 1. But it is easy to 
see that C is T-invariant. Indeed, if e G C then, since W s > 1 is stationary and £> invariant, 

1 = P (W>* G B|( £t ) t = e ) = P (T^ S >' G B|(e t ) t = Te) = P (W^* G B\{e t ) t = Te) 

so that Te G C. Now, since a stationary irreducible Markov chain is ergodic, Po {(£t)t £ C) 

equals or 1. This concludes the proof of ergodicity of the sequence W s ' 1 . 

To end with, note that for any t > 0, the random variable Wo,t is non positive. ■ 

3.2 Divergence properties of Information divergence rates 

Information divergence rates should be non negative: this is proved below. They also 
should be positive for parameters that are different than the true one: we only prove it for 
some subsets of the parameter set. We thus define Q eX p as the subset of Go such that the 
expectations of e\ under 9 and under 6q are not aligned with (0, 0): 

Qex P = {0 G Go : VA > 0,E„(ei) + AE (ei)} . 

®marg is the subset of Go such that Assumption 3 holds: 

Qrnarg = {# G Q : h X = f, h Y = #} • 



Theorem 2 Information divergence rates satisfy: 

• For all 9 G Q , D(0\6 ) > and D*(6\9 ) > 0. 

• For any 9 G Q exp , 6 + , we have D(0\6 ) > and D*(9\6 ) > 0. 

• If 9q and 9 are in @ m arg, D(9\9q) > and D*(9\9q) > as soon as f ^ fo or g ^ go- 
Notice that in case Assumption 2 holds, the expectations of E\ under 9 and under 9q are 
aligned with (0,0). In this case, we were not able to prove that h ^ ho implies positivity 
of information divergence rates. 

Proof 

Since for all t, 

E (logP pfl:iV t ,yi:M t )) " E (log¥ e (X 1:Nt ,Y 1:Mt )) 

is a Kullback-Leibler divergence, it is non negative, and the limit D*(6\9q) is also non 
negative. 

Let us prove that D(9\9q) is also non negative. To compute the value of the expectation 
Eo[w>t(#)], introduce the set A t of all possible values of Z t : 

A t = { (n, to) G N 2 : n V to < t < n + to} . (12) 

Then, 

E [w t (9)]= ^ Z^j ^o(Z t = (n,m),X 1:n = x 1 : n ,Y 1:m = y 1:m )logQe(x 1 : n ,y 1:m ). 

(fl,m)6A( Zl:n,J/l:m 

(13) 



14 



A. Arribas-Gil, E. Gassiat and C. Matias 



Now, by definition, 



D(e\e )= lira lEo flog ^^^ 



By using Jensen's inequality, 



E o log - — < logEo 



\Qe (Xi;N t ,Yi;Mt) 



But 



Q8 (Xl;N t ,Yl:M t 

( Qe{Xl;N t ,Yl:Mt) \ 
\QOo{Xl:Nt,Yl : Mt) J 

^ ^o( z t = (n,m),Xi :n = x 1:n ,Y 1:m = y 1:m ) X 

(n,m)£A t xi:n,Vl : m 



E, 



:ni V\:n 

Qe ( x i 



(a) 



< } ] ^ Fe(3s > 1, Z s = (n,m),Xi :n = x 1:n , Y 1:m = y hm ) 

:m 

< n^s>l,Z s = (n,m)), 

(n,m)£At 

where (a) comes from expression (jlljl . Finally, 



lim - (wt(6) — wt(6o)) < liminf - log 

t— >+oo f t^+oo t 



Pfl(38>l,Z, = (n,m)) 

(n,m)6/l[ 



(14) 



But the cardinality of ^4t is at most £ 2 , so that 



lim - (w t (9) - w t (9 )) < liminf- log t 2 = 0, 

t— >+oo t t^+oo t 



and 



V0 G 9 , D(6\0 ) > 0. 
Since G 0o, there exists 5g such that 6* G 6<5 e . By using (jTTj) . one gets the lower bound 

Qe(x 1:n ,y 1:m ) >5% +m inf [F e {s 1:]e] = e)] . 

6 G C- tt, , 771 

Since trajectories e in £ n m have length at most n + m, 

inf [p, ( £l:|e| = e)] > 

Note also that if (n,m) belongs to At then we have n + m < 2t and n V m > i/2. Thus, 
uniformly with respect to (n,m) G A 4 and to x\- n and yi :m , 



Moreover, with 



At\og5 e < }ogQe(x 1:n ,y 1:m ) < 0. 
100 V \\g\\oo V ll/illoo < 1 - ^ < 1 



(15) 
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one has for any integers n, m, any x\- n and y\-. m 

Qe{xi:n,yi: m )<pf /m - 

In this case, for all t, and uniformly with respect to (n,m) 6 At and to x\- n and yi-.m, 

log Qe(x lm , yi-m) < - log(l - 5 ). (16) 

Inequalities (fT5|) and (|T6|) allow to conclude that 

-Ce < w(6 ) < -c 9o and - C e < w{9) < -c 9 . 
Then, as soon as Bt is a set such that 

lim F (Z t <£B t )=0, (17) 



we have 



D(0\9 O )= lim -E 

t^+oo t 



iog Q j o ^ 1:Nt, l 1:Mt } )nz t eB t } 



Now, using Jensen's inequality. 
Qe{Xv.NtiYi.Mt 



E, 



log- 



But as previously seen, 

Qtf(-^l:JVt>^l:Af f 



H{Z t G Sj 



Qe(-X"l:iV t ,il:Aft) 



<P (Z t GB t ) log E c 



Qe(^l:Ar t ,^l:Af t y 



Q6> (^l:iV t )il:M t ^ 



Z t G Sj 



E, 



Qe {X\;Nf>Y\ : Mt 

E E 



o(Z 4 = (n, m), Xi :n = xi :n , Yj. m = y 1;m ) Qe{xv.n, Dim) 

X 



£ E 

(n,m)gSt 



Pg(3g >1,Z S = (n,m)) 
MZt € B t ) 



Finally, 



-£>(0|0 O )< I™ -logP e (3s> 1,Z S GBt). 

t— >+oo I 



(18) 



Let us now consider the case where the expectations of E\ under parameters 9 and 9$ 
are not aligned with (0, 0), that is 9 G Q e xp- We have 

7?= inf ||E e (ei)-AE (ei)|| > 0, 



where || • || denotes the euclidean norm. Define 

(n, m) 



B, 



'n, m) G A t 



t 



E (ei) 



16 



A. Arribas-Gil, E. Gassiat and C. Matias 



Then, (|17j) holds. Any trajectory e ending at point (n, m) has length at least 11V m which 
is at least t/2 when (n,m) £ Bt- Thus for such (n,m): 



< 



< 







f 






> 




inf 






2 


AeK 




> 


f 


inf 






2 


AeR 



— - AEo(ei) 
s 

— - AEo(ei) 
s 



t 

< - 

s 



y-Eo(ei) 



3 ^2' 



--E*(ei) 
s 



>5 



Now, using easy Cramer-Chernoff bounds, since tt is irreducible, one has that there exists 
a positive c(rj) and some so > such that as soon as s > sq, 



Ee(ei) 



7? 

> - ) < exp(-sc(??)) , 



and by summing over s, there also exists a positive C such that for large enough t, 



3s > 



t 



Ee(ei) 



>|) <f«cp(-/rti/)/2) 



Thus, using (|18|). one obtains that for G 



£>(0|0 O ) > 4^ > o. 

Let us now consider the case where 6q and 6 are in Q marg . Then, using Jensen's 
Inequality and definition (|llj). 



E log 



<56»o(^l:iVt ; ^l:M^ 

= E > , > , Po(^t = (n, m),X 1:n = xi :n , Yi :m = yi- m ) log 

: m 

< E ^2^o( Z t = (n,m),Xi :n = Xi :n ) 



Qe(xi:n,yi:m) 
Qe {xi:n,yi:m) 



(n,m)aAt x l:n 



bg E 



Wl-.m 



%(Z t = (n,m),X 1:n = xi :n ,Y 1:m = yy. m )Qe{xi: n ,yi:m) \ 
F Q (Z t = (n,m),Xu n = xu n )Qe (xu n ,yi :m ) J 

' e {3s>l,Z s = {n,m))f® n {x 1: 



^ E E P 0( Z ' = ( n ' m ))^ n (^l:n)lo. 
(n,m)eA t x l 



MZt = (n,m))f® n ( Xl .. n ) 



where the last inequality comes from Lemma, [2| and the fact that 1Pq(.^ — {ri^Tn?j^X\- n — 

Xl:n,Yl- m = yi- m ) < Qe {xi:n,yi:m)- 

Thus, since t~ l N t tends to (1 - p), Po-a.s. as t tends to infinity, and (1 — p) > since 
G 00, we have 



D(6\e ) < limsup- 



+ 



£ ¥ o(Z t = (n,m) ) \ lo. 

(n,m)£A t ,n>^Ztt 
(1-P) 



i(3s >1,Z S = (n,m)) 
F (Z t = (n,m)) 



fix) 
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as soon as / ^ fo ■ A similar proof applies if g ^ go . 

Proofs of divergence properties for D* follow the same lines. ■ 

3.3 Continuity properties 

On 0,5, the log-likelihoods are uniformly equicontinuous, with a modulus of continuity that 
does not depend on trajectories, as appears in the proof of the following Lemma. 

Lemma 3 The families of functions {t~ 1 wt(6)}t>i and {t £t(6)}t>i are uniformly equicon- 
tinuous on (3$. 

A consequence of this Lemma and the compactness of ©5 is: 
Corollary 1 The following holds: 

i) {i~ 1 iut(#)}t (resp. {t~ 1 £ t (6)}t) converges Fo-almost surely to w(6) (resp. to £(0)) 
uniformly on 0$; 

ii) £(9) and w(9) are uniformly continuous on 0$. 

Proof of Lemma El 

Let a > 0, and 61,62 £ ©<s such that ||#i — ^lloo < ex. 

Let us denote fj,g i} nd i , fe i , ge i and hg i the parameters of the hidden Markov chain and of 
the emission distributions under 6i, i = 1,2. 
For any e G £-N t ,Mt '■ 

1 'logP^Oi;^! = e,X 1:Nt ,Y 1:Mt ) - logPfl 2 (ei : | e | = e,X l:Nt ,Yi :Mt )\ 



t 



< - |log^ 1 (ei)-log^ 2 (ei)|+ - ^2 ( y Y2Me i - 1 =k,e i =lfj jlog ne 1 (k, Z) -log n$ 2 (k,l)\ 

k,le£ «=2 

+ -E (E n ^ = °)> x ^ = a >) 1 lo s fa ( a ) - lo § fa (°) 1 

aeA ^ i=l 



^2 H e i = (0, l),Y Mi = a}) I log ge 1 (a) - log gg 2 (a) 



i=l 



+ 7 ^2 (^H e i = i 1 ^),^^ = a,Y Ml = a'}\ \logh dl (a,a') -loghg 2 (a,a')\. 



t 

a,a'&A i=l 

In this sum, at most 2|e| terms are non null. Since all the components of 6i, i = 1,2 are 
bounded below by 5 and \\6i — 62W00 < we have : 

1 2 1 e I a 

-|logP ei (e 1: | e | = e,Xi :Nt ,Y 1:Mt ) - logPe 2 (e 1: | e | = e,X l:Nt ,Y v , Mt )\ < —j- 

But for any e € £N t ,M t , we have |e| < 2t, so that 

1 4a 

-|logP 01 (ei : | e | = e,X 1:Nt ,Y 1:Mt ) - logP 02 (£ 1: | e | = e,Xi :JVt , Y 1:Aft )| < — , 
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as soon as \\6\ — #2||oo < &■ 
Now we get 

Qe 1 (Xl:N t ,Yl:Mt) = ^01 (El:|e| = e > -^l:JVt 3 ^l:Aft) 



< 



expj^tj P ^( e l:|e| = e,X 1:Nt ,Yi :Mt ] 



e&£-N t ,M t 



(4a ] 

< exp <^ —t > Qe 2 (X 1:Nt ,Y 1:Mt ), 

and t \ogQ0 1 (Xi : N t ,Yi : Mt) — 4a/<5 + i" 1 log QQ 2 (Xi : w t , Y\-M t )- Since this is symmetric in 
9\ and #2> ° ne obtains that for any 61,62 £ @$ such that \\6\ — #2||oo < 

1^!) _ 1^(^)1 <^!. 

The same proof applies to t~ l l t - ■ 



4 Statistical properties of estimators 

We now want to focus on a particular form of the pair-HMM, relying on a re-parametrization 
of the model. Indeed, the pair-HMM has been introduced to take into account evolutionary 
events. The corresponding evolutionary parameters are the ones of interest and practitioners 
aim at estimating those parameters rather than the full pair-HMM. Examples of such 
re-parametrization may be found for instance in ( Thorne, Kishino, and Felsenstein 1991 



Thorne, Kishino, and Felsenstein 1992D (see also Section 5 of this paper). Let j3 1— > 0(j3) be 



a continuous parametrization from some set B to 0. For any 5 > 0, let B$ = 6~ 1 (@$). We 
assume that /3q = 6^ 1 (6q) in B$ for some 5 > 0. Use of pair-HMM algorithms to estimate 
evolutionary parameters corresponds to the estimator 

Definition 2 

fa = Argmax w t {6((3)). 
peB s 

Then, 

Theorem 3 If the set of maximizers of w(6(f3)) over B$ reduces to {Po}> Pt converges 
Po- almost surely to /3q. 

The proof of this theorem follows from Corollary ^ and usual arguments for M-estimators. 
The condition that the set of maximizers of w(6{j3)) over B$ reduces to {/?o} corresponds 
to some identifiability condition and thus may not be avoided. 



Another interesting approach to sequence alignment by pair-HMMs is to consider a 
non-informative prior distribution on the parameters to produce, via a MCMC procedure, 
the posterior distribution of the alignments and parameters given the observed sequences. 
Using Qg as the likelihood of the observed sequences produces a posterior distribution as 
follows. Let v be a prior probability measure on B$ and (5 a random vector distributed 



Pair-HMM parameter estimation 



19 



according to v. MCMC algorithms approximate the random distribution v\x 1 . Nt .Y 1 . 
preted as the posterior measure given observations X\-N t and Y\-M t '- 

Qe(f3)( X l:N t ,Yl:M t )v(dP) 



inter- 



(19) 



Ib s Qe{f3)( x ^Nti Y i-.M t )v{df3) ' 

This leads to Bayesian consistent estimation of /3q as in classical statistical models (see 
(Ibragimov and Has'minskii 1981 ) for instance). Notice that since wt is not the logarithm of 
a probability distribution on the observation space, these results are not direct consequences 
of classical ones. Though, the proof follows classical ideas of Bayesian theory. 

Theorem 4 If the set of maximizers of w{6((3)) over reduces to {Po}, and if v weights 
Po, then the sequence of posterior measures y\x VN ,Yi-m conver 9 es ^ n distribution Po- almost 
surely to the Dirac mass at (3$. 

Proof Let m : B$ — > R be any continuous, bounded function. For any e > 0, let a such 



that \m((3) — m(f3')\ < e as soon as 

Qe(J3)(Xl:Nt>Yl:Mt) 



J 

Jb s 



m(J3) 



Ib s Qe((3)(Xi:N t ,Y v . Mt )v(d(3) 



P'\\ < a. We have 
v(d0) - m{fio) 



< 



Ib s \ m (P) ~ m (Po)\Qe(/3)(X v . Nt ,Y 1 . Mt )v(df3) 



But 



so that 



Ib s Qe(i3)(X 1:Nt ,Y 1:Mt )v(d(3) 
hp-M\<a l m (^) ~ m (Po)\Qe(p)(Xi:N t ,Y 1:Mt )i'(df3) 



< e 



< e + 2\\m 



e + 2\\m\ 



Ib s Qe(l3)( X l:Nt, Y l:M t )v(dp) 
Q0U3)(Xi:Nt,Y 1:Mt ) 

- r — — ——- v {dp) - m(Po) 

I\\/3-l3o\\>aQeW) { X l:Nt,Yl:Mt)v(d(3) 
Ib 5 Qd(f3) ( X l:N t , Yi :Mt )v(df3) 

' L^ V {t{\ Wt {9(P)))}v(dp) 



Use of Corollary ^ and the fact that the set of maximizers of w{6{(3)) over B$ reduces to 
{A)} gives 7/ > and T such that for t > T and ||/?-/?o|| > a, t^w^iP)) - t~ l w t {9{p )) < 



—7], and then there exists 7 > such that for t > T and 
rVWo)) > -I Then 



Po\\ < 7, t^wMP)) 



ill, 



-M>a 



exp{t(} Wt (9(P)))}u(dp) 



< 



< 



f Bs exp{t(lw t (9(P)))}u(dp) 

I\\p- M>a exp {t {\w t {6{f3)) - \w t (9(Po))) } v(dP) 
! w _ M ^^ V {t{\w t (9{P)) - \w t (9{P,)))}v(dP) 



cxp 



HI) 



-A)||<7 



v{dp) 
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Using that v weights 0q we finally obtain 

Qe{p)(Xl:N t iYv.Mt 



lim 

t— >oo 



m(p) 



P - a.s. (20) 



But it exists a countable collection of continuous and bounded functions that are determin- 
ing for convergence in distribution and the union of the corresponding null sets in which 
(12011 does not hold is still a null set. Then 



a.s. 



(21) 



5 Simulations 

5.1 A simple model 

For the whole simulation procedure we consider the following substitution model: 

h( XV )-{ f{x){l - e ~ a)f{y) '*** V (22) 

[ V> 1 f(x){{l-e- a )f(x)+e- a } otherwise, { ' 

where a > is called the substitution rate and for every letter x, f(x) equals the equilibrium 
probability of x. This equilibrium probability distribution is assumed to be known and will 
not be part of the parameter. Here, the emission distribution g equals /, and Assumption 
3 holds. The unknown parameter is thus (3 = (tt, a). This is a classical substitution model 
(used for instance in ( Thorne, Kishino, and Felsenstein 1991| )) where the substitution rate 



is independent of the type of nucleotide being replaced and 1— e~ a represents the probability 
that a substitution occurs. We shall consider hidden Markov chains that satisfy Assumption 
2, and will present: 

• Simulations with i.i.d. (e s ) s where probabilities of horizontal or vertical moves equal 
Po and probability of diagonal moves equals tq = 1 — 2pQ. Here, the parameter reduces 
to P = (p, a). 

• Simulations with stationary Markov chains such that po = qo- The parameter dimen- 
sion then reduces to 6 (including a). 

Notice that none of these situations is covered by Theorem |5| we do not know in those 
cases whether the information divergence rates are positive at a parameter value different 
from the true one. 

In both cases, we get estimations of the parameters via MLE (taking Qg as the likelihood 
as it is done in practice), and in the i.i.d. case we compute and compare the functions w 
and I. 

5.2 Simulations with i.i.d. fe,)« 



We have simulated 200 alignments of length 15000 with substitution rate oq = 0.05 and 
Po = Qo = 0.25. We have set the equilibrium probability of every nucleotide to 0.25. We 
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show in FigureElhistograms for the maximum likelihood estimations of both parameters. In 
a first part we keep a fixed at a® and estimate p and then we keep p fixed at po and estimate 
a. That produces good estimations of the parameters even if a is a bit underestimated. 
However when estimating p and a simultaneously (second part) we obtain no satisfying 
results especially on a (see Figure OJ). 




0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 



Figure 3: Histograms of maximum likelihood estimations of parameters obtained with 200 
simulations from the i.i.d. model. On the left: estimation of p given a = ao and estimation 
of a given p = p$. On the right: joint estimation of p and a. 

That can be explained by looking at the graph of w(@) and comparing it to £(f3) (Figure 
SJ. We see that both w and £ are very flat with respect to a and as we deal with numerical 
precision errors, finding out the true maximum value becomes impossible. However, for 
p = po if we look closely at the cuts of £ and w we appreciate that i takes its maximum on 
olq and w near this point. As the maximisation problem complexity is reduced in this case 
we are able to find a quite good estimation for a. Concerning p, we see that both £ and 
w have a clear maximum near po, but again £ is less flat than w at this point. This is not 
surprising since £ really is the information divergence rate of the model. 

5.3 Simulations with Markov chains satisfying Assumption 2 

We have simulated 200 alignments of length 15000 with substitution rate ao = 0.05 and 
the following transition matrix for (e s ) s 





D 


H 


V 




( 0.7 


0.2 


0.1 


* 1 


0.3 


0.5 


0.2 


V 


V 0.3 


0.1 


0.6 



with initial distribution po = qo = 0.25. We have set as free parameters ithh, ^hv, ^dv, 
ttw and 7TDH- The equilibrium probability of every nucleotide is again fixed to 0.25. We 
can observe in Figure [5] that the maximum likelihood estimators for these parameters and 
for a are close to their true values even when the estimation is done jointly. These results are 
rather encouraging since the Markov case is the interesting one in biological applications. 
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Figure 4: On top: i and w for the i.i.d. model (po = 0.25, ao = 0.05). On bottom: cuts of 
£ and w for a = ao fixed and for p = po fixed. 
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